Replacement Text Case Conversion

Some applications can insert the text matched by the regex or by capturing groups converted to uppercase or lowercase. The Just Great Software applications allow you to prefix the matched text token \0 and the backreferences \1 through \99 with a letter that changes the case of the inserted text. U is for uppercase, L for lowercase, I for initial capitals (first letter of each word is uppercase, rest is lowercase), and F for first capital (first letter in the inserted text is uppercase, rest is lowercase). The letter only affects the case of the backreference that it is part of.

When the regex (?i)(Helló) (Wórld) matches HeLlÓ WóRlD the replacement text \U1 \L2 \I0 \F0 becomes HELLÓ wórld Helló Wórld Helló wórld.

Perl String Features in Regular Expressions and Replacement Texts

The double-slashed and triple-slashed notations for regular expressions and replacement texts in Perl support all the features of double-quoted strings. Most obvious is variable interpolation. You can insert the text matched by the regex or capturing groups simply by using the regex-related variables in your replacement text.

Perl’s case conversion escapes also work in replacement texts. The most common use is to change the case of an interpolated variable. \U converts everything up to the next \L or \E to uppercase. \L converts everything up to the next \U or \E to lowercase. \u converts the next character to uppercase. \l converts the next character to lowercase. You can combine these into \l\U to make the first character lowercase and the remainder uppercase, or \u\L to make the first character uppercase and the remainder lowercase. \E turns off case conversion. You cannot use \u or \l after \U or \L unless you first stop the sequence with \E.

When the regex (?i)(helló) (wórld) matches HeLlÓ WóRlD the replacement text \U\l$1\E \L\u$2 becomes hELLÓ Wórld. Literal text is also affected. \U$1 Dear $2 becomes HELLÓ DEAR WÓRLD.

Perl’s case conversion works in regular expressions too. But it doesn’t work the way you might expect. Perl applies case conversion when it parses a string in your script and interpolates variables. That works great with backreferences in replacement texts, because those are really interpolated variables in Perl. But backreferences in the regular expression are regular expression tokens rather than variables. (?-i)(a)\U\1 matches aa but not aA. \1 is converted to uppercase while the regex is parsed, not during the matching process. Since \1 does not include any letters, this has no effect. In the regex \U\w, \w is converted to uppercase while the regex is parsed. This means that \U\w is the same as \W, which matches any character that is not a word character.

Boost’s Replacement String Case Conversion

Boost supports case conversion in replacement strings when using the default replacement format or the “all” replacement format. \U converts everything up to the next \L or \E to uppercase. \L converts everything up to the next \U or \E to lowercase. \u converts the next character to uppercase. \l converts the next character to lowercase. \E turns off case conversion. As in Perl, the case conversion affects both literal text in your replacement string and the text inserted by backreferences.

where Boost differs from Perl is that combining these needs to be done the other way around. \U\l makes the first character lowercase and the remainder uppercase. \L\u makes the first character uppercase and the remainder lowercase. Boost also allows \l inside a \U sequence and a \u inside a \L sequence. So when (?i)(helló) (wórld) matches HeLlÓ WóRlD you can use \L\u\1 \u\2 to replace the match with Helló Wórld.

PCRE2’s Replacement String Case Conversion

PCRE2 supports case conversion in replacement strings when using PCRE2_SUBSTITUTE_EXTENDED. \U converts everything that follows to uppercase. \L converts everything that follows to lowercase. \u converts the next character to uppercase. \l converts the next character to lowercase. \E turns off case conversion. As in Perl, the case conversion affects both literal text in your replacement string and the text inserted by backreferences.

Unlike in Perl, in PCRE2 \U, \L, \u, and \l all stop any preceding case conversion. So you cannot combine \L and \u, for example, to make the first character uppercase and the remainder lowercase. \L\u makes the first character uppercase and leaves the rest unchanged, just like \u. \u\L makes all characters lowercase, just like \L.

In PCRE2, case conversion runs through conditionals. Any case conversion in effect before the conditional also applies to the conditional. If the conditional contains its own case conversion escapes in the part of the conditional that is actually used, then those remain in effect after the conditional. So you could use ${1:+\U:\L}${2} to insert the text matched by the second capturing group in uppercase if the first group participated, and in lowercase if it didn’t.

R’s Backreference Case Conversion

The sub() and gsub() functions in R support case conversion escapes that are inspired by Perl strings. \U converts all backreferences up to the next \L or \E to uppercase. \L converts all backreferences up to the next \U or \E to lowercase. \E turns off case conversion.

When the regex (?i)(Helló) (Wórld) matches HeLlÓ WóRlD the replacement string \U$1 \L$2 becomes HELLÓ wórld. Literal text is not affected. \U$1 Dear $2 becomes HELLÓ Dear WÓRLD.